What is Seaborn?¶
Seaborn gives us the capability to create amplified data visuals. This helps us understand the data by displaying it in a visual context to unearth any hidden correlations between variables or trends that might not be obvious initially. Seaborn has a high-level interface as compared to the low level of Matplotlib.¶
Why should you use Seaborn versus matplotlib?¶
Seaborn makes our charts and plots look engaging and enables some of the common data visualization needs (like mapping color to a variable). Basically, it makes the data visualization and exploration easy to conquer.¶
There are essentially a couple of (big) limitations in matplotlib that Seaborn fixes:
- Seaborn comes with a large number of high-level interfaces and customized themes that
matplotlib lacks as it’s not easy to figure out the settings that make plots attractive
- Matplotlib functions don’t work well with dataframes, whereas seaborn does
Setting up the Environment¶
To install Seaborn and use it effectively, first, we need to install the aforementioned dependencies. Once this step is done, we are all set to install Seaborn and enjoy its mesmerizing plots. To install Seaborn, you can use the following line of codeTo install the latest release of seaborn, you can use pip:
!pip install seaborn
pip install seaborn
Requirement already satisfied: seaborn in c:\users\user\anaconda3\lib\site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (2.2.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (3.9.2) Requirement already satisfied: contourpy>=1.0.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1) Requirement already satisfied: pillow>=8 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\user\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\user\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
pd.__version__
'2.2.2'
Datasets Used for Data Visualization¶
We’ll be working primarily with a dataset
HR_Employee_Attrition_Data.csv
Preparing the data¶
# importing the dataset
df_HR = pd.read_csv(r'C:\Users\User\OneDrive\Documents\AWP Module\HR_Employee_Attrition_Data.csv')
df_HR.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 3 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 4 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 5 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
df_HR.columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
pd.set_option('display.max_columns',None)
df_HR.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 3 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 4 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 5 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
df_HR.shape
(2940, 35)
df_HR_num = df_HR.select_dtypes(include = 'number')
df_HR_cat = df_HR.select_dtypes(include = 'object')
print(df_HR_num.columns)
print(df_HR_cat.columns)
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager'],
dtype='object')
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
dtype='object')
for x in df_HR_num.columns:
if df_HR_num[x].nunique() < 5:
print(f'Column {x} has {df_HR_num[x].nunique()} values')
Column EmployeeCount has 1 values Column EnvironmentSatisfaction has 4 values Column JobInvolvement has 4 values Column JobSatisfaction has 4 values Column PerformanceRating has 2 values Column RelationshipSatisfaction has 4 values Column StandardHours has 1 values Column StockOptionLevel has 4 values Column WorkLifeBalance has 4 values
df_HR_num.drop(['EmployeeCount', 'StandardHours'], inplace = True, axis = 1)
df_HR.columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
df_HR_num.columns
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeNumber',
'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager'],
dtype='object')
for x in df_HR_cat.columns:
if df_HR_cat[x].nunique() < 5:
print(f'Column {x} has {df_HR_cat[x].nunique()} values')
Column Attrition has 2 values Column BusinessTravel has 3 values Column Department has 3 values Column Gender has 2 values Column MaritalStatus has 3 values Column Over18 has 1 values Column OverTime has 2 values
df_HR_cat.drop('Over18', inplace = True, axis = 1)
df_HR_cat.columns
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
'JobRole', 'MaritalStatus', 'OverTime'],
dtype='object')
df_HR_num.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 2940 non-null int64 1 DailyRate 2940 non-null int64 2 DistanceFromHome 2940 non-null int64 3 Education 2940 non-null int64 4 EmployeeNumber 2940 non-null int64 5 EnvironmentSatisfaction 2940 non-null int64 6 HourlyRate 2940 non-null int64 7 JobInvolvement 2940 non-null int64 8 JobLevel 2940 non-null int64 9 JobSatisfaction 2940 non-null int64 10 MonthlyIncome 2940 non-null int64 11 MonthlyRate 2940 non-null int64 12 NumCompaniesWorked 2940 non-null int64 13 PercentSalaryHike 2940 non-null int64 14 PerformanceRating 2940 non-null int64 15 RelationshipSatisfaction 2940 non-null int64 16 StockOptionLevel 2940 non-null int64 17 TotalWorkingYears 2940 non-null int64 18 TrainingTimesLastYear 2940 non-null int64 19 WorkLifeBalance 2940 non-null int64 20 YearsAtCompany 2940 non-null int64 21 YearsInCurrentRole 2940 non-null int64 22 YearsSinceLastPromotion 2940 non-null int64 23 YearsWithCurrManager 2940 non-null int64 dtypes: int64(24) memory usage: 551.4 KB
df_HR_cat.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition 2940 non-null object 1 BusinessTravel 2940 non-null object 2 Department 2940 non-null object 3 EducationField 2940 non-null object 4 Gender 2940 non-null object 5 JobRole 2940 non-null object 6 MaritalStatus 2940 non-null object 7 OverTime 2940 non-null object dtypes: object(8) memory usage: 183.9+ KB
# pd.set_option('display.max_columns', None)
df_HR_num.describe()
| Age | DailyRate | DistanceFromHome | Education | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 |
| mean | 36.923810 | 802.485714 | 9.192517 | 2.912925 | 1470.500000 | 2.721769 | 65.891156 | 2.729932 | 2.063946 | 2.728571 | 6502.931293 | 14313.103401 | 2.693197 | 15.209524 | 3.153741 | 2.712245 | 0.793878 | 11.279592 | 2.799320 | 2.761224 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
| std | 9.133819 | 403.440447 | 8.105485 | 1.023991 | 848.849221 | 1.092896 | 20.325969 | 0.711440 | 1.106752 | 1.102658 | 4707.155770 | 7116.575021 | 2.497584 | 3.659315 | 0.360762 | 1.081025 | 0.851932 | 7.779458 | 1.289051 | 0.706356 | 6.125483 | 3.622521 | 3.221882 | 3.567529 |
| min | 18.000000 | 102.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | 1.000000 | 1009.000000 | 2094.000000 | 0.000000 | 11.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 30.000000 | 465.000000 | 2.000000 | 2.000000 | 735.750000 | 2.000000 | 48.000000 | 2.000000 | 1.000000 | 2.000000 | 2911.000000 | 8045.000000 | 1.000000 | 12.000000 | 3.000000 | 2.000000 | 0.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
| 50% | 36.000000 | 802.000000 | 7.000000 | 3.000000 | 1470.500000 | 3.000000 | 66.000000 | 3.000000 | 2.000000 | 3.000000 | 4919.000000 | 14235.500000 | 2.000000 | 14.000000 | 3.000000 | 3.000000 | 1.000000 | 10.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
| 75% | 43.000000 | 1157.000000 | 14.000000 | 4.000000 | 2205.250000 | 4.000000 | 84.000000 | 3.000000 | 3.000000 | 4.000000 | 8380.000000 | 20462.000000 | 4.000000 | 18.000000 | 3.000000 | 4.000000 | 1.000000 | 15.000000 | 3.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
| max | 60.000000 | 1499.000000 | 29.000000 | 5.000000 | 2940.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | 4.000000 | 19999.000000 | 26999.000000 | 9.000000 | 25.000000 | 4.000000 | 4.000000 | 3.000000 | 40.000000 | 6.000000 | 4.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
df_HR_cat.describe()
| Attrition | BusinessTravel | Department | EducationField | Gender | JobRole | MaritalStatus | OverTime | |
|---|---|---|---|---|---|---|---|---|
| count | 2940 | 2940 | 2940 | 2940 | 2940 | 2940 | 2940 | 2940 |
| unique | 2 | 3 | 3 | 6 | 2 | 9 | 3 | 2 |
| top | No | Travel_Rarely | Research & Development | Life Sciences | Male | Sales Executive | Married | No |
| freq | 2466 | 2086 | 1922 | 1212 | 1764 | 652 | 1346 | 2108 |
¶
Note above the difference between the output of describe function on Numerical(statistical) data and categorical data. For numerical data, output is statistical values like mean, std, min, max and percentiles. While for categorical data we have unique, top frequency level and the frequency count.
Data Visualization using Seaborn¶
This implementation section is divided into two categories:
● Visualizing statistical relationships
● Plotting categorical data
We’ll look at multiple examples of each category and how to plot it using seaborn.
¶
Visualizing statistical relationships
A statistical relationship denotes a process of understanding relationships between different variables in a dataset and how that relationship affects or depends on other variables.
Scatterplot using Seaborn¶
A scatterplot is perhaps the most common example of visualizing relationships between two variables. Each point shows an observation in the dataset and these observations are represented by dot-like structures. The plot shows the joint distribution of two variables using a cloud of points.To draw the scatter plot, we’ll be using the relplot() function of the seaborn library. It is a figure-level role for visualizing statistical relationships. By default, using a relplot produces a scatter plot:
df_HR_num.columns
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeNumber',
'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager'],
dtype='object')
# relpot,catplot,displot ,pairplot ,joinplot
sns.relplot(x = 'Age', y = 'MonthlyIncome', data = df_HR_num, height = 6, aspect = 1.5)
plt.show()
#show hue :job sat ,size=joblevel,stype=perfrating,palette
sns.relplot(x = 'Age', y = 'MonthlyIncome',hue='JobSatisfaction',size='JobLevel',style='PerformanceRating',data = df_HR_num, height = 6, aspect = 1.5,palette='bright')
plt.show()
sns.relplot(x = 'Age', y = 'MonthlyIncome',hue='JobSatisfaction',size='PerformanceRating',data = df_HR_num, height = 6, aspect = 1.5,palette='mako')
plt.show()
df_HR_num['PerformanceRating'].value_counts()
PerformanceRating 3 2488 4 452 Name: count, dtype: int64
Note above how we used the height parameter to specify the height instead of plt.figure(figsize). Aspect is the ratio of width to height i.e height of 8 * aspect ratio of 1.5 gives a width of 12
Many (if not most) plot functions in seaborn take height and aspect as parameters for setting the dimensions of the plot.
df_HR_num.shape
(2940, 24)
df_HR_cat.shape
(2940, 8)
# show size hue,style ,palette
Here we have also specified the palette of colors to be used. More on Seaborns palettes can be found here :
https://seaborn.pydata.org/tutorial/color_palettes.html
Finally, we have again bifurcated this information using different symbols in the 'style' parameter on 'Attrition'.
Seaborn has various parameters for each plot - for e.g. the symbols in the style parameter can be ordered as per our requirement. The acceptable markers will probably be the same as for matplotlib shared in the matplotlib class. As you go through your projects and evolution as a Data Scientist, feel free to experiment and research on these further parameters. You will eventually build your go-to or preferred options and parameters and mostly re-use these. However, it is always good to know of the other available options.
The different kinds of color palettes available in Seaborn can be accessed here.
https://seaborn.pydata.org/tutorial/color_palettes.html
And we can use the sns.palplot() function to view these palettes.
palettePastel = sns.color_palette('pastel')
paletteDeep = sns.color_palette('deep')
paletteSet2 = sns.color_palette('Set2')
paletteMako = sns.color_palette('mako')
paletteMakoSeq = sns.color_palette("mako", as_cmap=True)
sns.palplot(palettePastel)
sns.palplot(paletteDeep)
sns.palplot(paletteSet2)
sns.palplot(paletteMako)
#sns.palplot(paletteMakoSeq)
sns.relplot(data=df_HR_num,x='Age',y='MonthlyIncome',height=8,aspect=1.5,kind='line',errorbar=None)
<seaborn.axisgrid.FacetGrid at 0x1cfc0c12d80>
plt.figure(figsize=(10,7))
sns.scatterplot(data=df_HR.head(100),x='Age',y='MonthlyIncome',hue='Attrition',palette='bright',\
size='JobLevel',style='JobSatisfaction')
plt.show()
plt.figure(figsize=(10,10))
sns.scatterplot(x = 'Age', y = 'MonthlyIncome',hue='JobSatisfaction',size='PerformanceRating',data = df_HR_num,palette='mako')
plt.show()
if we want to use sns.scatterplot we have to set figsize by plt.figure(figsize(a,b))¶
# sns.relplot(data,x,y,kind='kine'/'scatter')
# sns.scatterplot(data,x,y)
plt.figure(figsize = (8,8))
sns.lineplot(data = df_HR, x = 'Age', y = 'YearsAtCompany',errorbar=None)
plt.show()
sns.relplot()¶
- sns.lineplot = sns.relplot(kind = 'line')
- sns.scatterplot = sns.relplot(kind = 'scatter') (DEFAULT Relplot)
import warnings
warnings.filterwarnings('ignore')
sns.relplot(data = df_HR, x = 'Age', y = 'YearsAtCompany', height = 8, aspect = 1.5, kind = 'line',errorbar=None,hue='Attrition')
plt.show()
# Lineplot age vs years at company
sns.relplot(data=df_HR,x='Age',y='YearsAtCompany',kind='line',aspect=1.2,height=8,errorbar=None)
<seaborn.axisgrid.FacetGrid at 0x1cfc97b9790>
sns.relplot(data=df_HR,x='Age',y='PercentSalaryHike',kind='line',aspect=1.5,height=8,errorbar=None,hue='Attrition')
<seaborn.axisgrid.FacetGrid at 0x1cfcb711880>
#show hue=attrition ,ci=0
By changing the kind to 'line' we can use sns.relplot() function to draw line plots. Seaborn also has scatterplot() and lineplot() functions to draw these same plots. The parameters remain the same.
lmplot()¶
The lmplot() function in seaborn plots a scatterplot with a regression line overlaid.
sns.lmplot(data=df_HR,x='Age',y='MonthlyIncome',aspect=1.2,height=8) # best fit line for linear Regression
<seaborn.axisgrid.FacetGrid at 0x1cfc96cd490>
sns.residplot(data=df_HR,x='Age',y='MonthlyIncome')
<Axes: xlabel='Age', ylabel='MonthlyIncome'>
sns.lmplot(data = df_HR, x = 'YearsAtCompany', y = 'MonthlyIncome', height = 6, aspect = 2)
<seaborn.axisgrid.FacetGrid at 0x1cfcc71a060>
plt.figure(figsize = (10,6))
sns.residplot(data = df_HR, x = 'YearsAtCompany', y = 'MonthlyIncome') # resid shows error. Dotline upper part is +ve Error(residue)
# & Dotline lower part is -ve error(residue)
<Axes: xlabel='YearsAtCompany', ylabel='MonthlyIncome'>
Plotting Categorical Data¶
In the above section, we saw how we can use different visual representations to show the relationship between multiple variables. We drew the plots between two numeric variables. In this section, we’ll see the relationship between two variables of which one would be categorical (divided into different groups). We’ll be using the catplot() function of the seaborn library to draw the plots of categorical data. Previously this was factorplot(but default to scatterplot) - now factorplot() is less used.
df_HR.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 3 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 4 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 5 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
catplot¶
sns.catplot(data=df_HR,y='MonthlyIncome',kind='box',hue='Attrition',col='Department')
<seaborn.axisgrid.FacetGrid at 0x1cfc9879a30>
sns.catplot(data=df_HR,x='MonthlyIncome',kind='box',y='JobRole',aspect=1.3,height=8)
<seaborn.axisgrid.FacetGrid at 0x1cfcde277d0>
sns.catplot(data=df_HR,x='MonthlyIncome',kind='box',y='JobRole',aspect=1.3,height=8,hue='Attrition',col='Gender')
<seaborn.axisgrid.FacetGrid at 0x1cfcc783c80>
# striplot
sns.catplot(data=df_HR,y='DistanceFromHome',x='Attrition',jitter=False) # jitterplot or strip plot
<seaborn.axisgrid.FacetGrid at 0x1cfce55d490>
sns.catplot(data=df_HR,y='DistanceFromHome',x='JobSatisfaction',kind='violin',hue='Attrition',split=True,
inner='quartile',aspect=2,height=6,col='Department')
<seaborn.axisgrid.FacetGrid at 0x1cfc97f8770>
df_HR_cat.columns
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
'JobRole', 'MaritalStatus', 'OverTime'],
dtype='object')
sns.catplot(x = 'Attrition', y = 'YearsAtCompany', data = df_HR) # striplot
plt.show()
The above is a striplot (also called jitterplot) showing the points on the plot corresponding to x and y values. The points are scattered across the y dimension because they are deviating from the true X value (i.e. Jittering) so that they dont overlap completely. If we set jitter to false, they will be plotted on the true X value and we will see only one point for every y point on x tick.
sns.catplot(x = 'Attrition', y = 'YearsAtCompany', data = df_HR,jitter=False) # striplot
plt.show()
#show jitter=True
The kind parameter in catplot - takes the following non-default values.
"strip" - default, seen above Default func = sns.stripplot()
"swarm" - similar to jitter = True but the points are even more spread apart across the y-axis. Direct func - sns.swarmplot()
"box" - Shows the statistical representation of the chosen y across the x ticks. Shows the outlier points, positive 1.5 IQR, negative 1.5 IQR, 25%, 50% Mean, 75% values in the box which represents the IQR. Direct func - sns.boxplot()
"violin" - Distribution of the y points across the x ticks. Shows the box plot in the center of each plot. Distribution is mirrored on both sides of the center of the plot. Direct func - sns.violinplot()
"point" - Shows the point estimate as a point for each x tick and the level of uncertainity around that point estimate is shown by the lines above and below the point estimate - default is mean. Direct func - sns.pointplot()
"bar" - Bar plot - Direct func - sns.barplot()
"count" - Count Plot - Frequency of X. Direct func - sns.countplot()
From the documentation of catplot, we can see that each of these 'kind' parameters also have their own function plots i.e.
instead of calling sns.catplot(data, x, y, kind = 'swarm') we could call
sns.swarmplot(data, x, y) with the same parameters.
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR, kind = 'strip', height = 8, aspect = 1.5)
plt.show()
sns.catplot(data=df_HR,x='Attrition',y='Age',kind='strip',jitter=False,col='Gender') # striplot jitterplot is same parameter
<seaborn.axisgrid.FacetGrid at 0x1cfc1a17b30>
# take hue=attrition
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR,hue='Attrition', kind = 'strip', height = 8, aspect = 1.5)
plt.show()
sns.catplot(x = 'MonthlyIncome', data = df_HR, kind = 'box', height = 8, aspect = 2)
plt.show()
sns.catplot(x = 'Age', data = df_HR, kind = 'box', height = 8, aspect = 2)
plt.show()
sns.catplot(y = 'Age', x='Department',data = df_HR, kind = 'box', height = 8, aspect = 2,hue='Attrition')
plt.show()
df_HR.columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
Violin plots¶
The violin plots combine the boxplot and kernel density estimation procedure to provide richer description of the distribution of values. The quartile values are displayed inside the violin.
sns.catplot(x = 'MonthlyIncome', data = df_HR, kind = 'violin', height = 6, aspect = 2)
plt.show()
# swarmplot,violinplot
sns.catplot(data=df_HR,x='JobSatisfaction',y='DistanceFromHome',kind='violin',aspect=1.2,hue='Attrition',split=True,inner='quartile',\
palette='muted')
<seaborn.axisgrid.FacetGrid at 0x1cfdfdc39b0>
sns.catplot(data=df_HR,x='JobSatisfaction',y='DistanceFromHome',kind='violin',aspect=1.2,hue='Attrition',split=True,inner='quartile',\
palette='muted',col='Gender')
<seaborn.axisgrid.FacetGrid at 0x1cfdfe5a300>
We can also overlay one plot over another as we did with matplotlib. But for this we need the figure object of matplotlib.
plt.figure(figsize = (16,10))
sns.swarmplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR, palette = 'bright')
sns.violinplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR, palette = 'pastel')
plt.show()
sns.catplot(data=df_HR,x='Department',kind='count',height=6,aspect=2,hue='Attrition',col='JobSatisfaction')
<seaborn.axisgrid.FacetGrid at 0x1cfc0e7df70>
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'violin', height = 6,\
aspect = 2)
plt.show()
By setting the split parameter to true - we can get the distributions of Attrition - Yes, Attrition - No on both sides of the plot.
#split=True
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'violin', height = 6,\
aspect = 2,split=True)
plt.show()
However, we are only getting one box plot in the center for the overall attrition. We can rectify this by adding the inner = 'quartile' parameter. Now it will show us the lines on each distribution showing the 25%, 50%(median) and 75% of the distribution.
# inner='quartile'
#split=True
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'violin', height = 6,\
aspect = 2,split=True,inner='quartile')
plt.show()
Pointplot¶
Pointplot connects data from the same hue category. This helps in identifying how the relationship is changing in a particular hue category.
sns.catplot(x = 'PercentSalaryHike', y = 'JobSatisfaction',hue = 'Attrition', data = df_HR, kind = 'point', height = 6,\
aspect = 1.5)
plt.show()
# countplot
sns.catplot(data=df_HR,x='Education',kind='count',hue='Attrition',col='Gender')
<seaborn.axisgrid.FacetGrid at 0x1cfc96ba570>
sns.catplot(data=df_HR,x='EnvironmentSatisfaction',y='TotalWorkingYears',kind='bar',hue='Attrition',errorbar=None)
<seaborn.axisgrid.FacetGrid at 0x1cfceee47a0>
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'bar',errorbar=None) # errorbar=None
plt.show()
sns.catplot(x = 'JobSatisfaction', data = df_HR,hue='Attrition',kind = 'count',height = 6, aspect = 2)
plt.show()
sns.catplot(x = 'Attrition', data = df_HR, kind = 'count',height = 6, aspect = 0.5,col='JobSatisfaction')
plt.show()
Using seaborn we can visualise higher dimension relationships as well with the 'col' parameter
#hue :attrition col=jo satisfaction
df_HR.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 3 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 4 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 5 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
Visualizing the Distribution of a Dataset¶
Univariate distributions - Histograms¶
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 1.5,kde=True)
<seaborn.axisgrid.FacetGrid at 0x1cfcf0d7ce0>
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 1.5,kde=True,col='JobSatisfaction')
<seaborn.axisgrid.FacetGrid at 0x1cfd7c67020>
As with relplot and catplot, the kind parameter for displot takes the following inputs:
hist - Data is divided into bins. This is the default. Direct function histplot()
kde - Kernel Density Estimator - Probability estimate of a random variable. We can choose kde = True in a histplot to get both the kde and histogram bars. Direct function kdeplot().
ecdf - For visualising each of the datapoints in a cumulative manner. Explanation below. Direct func = ecdfplot() Empirical Cumulative Distribution
rugplot - Can either be used with other distribution plots with rug = True or drawn seperately with rugplot(). Draws ticks along the x axis for each datapoint thereby the density at different points in the data can be analyzed.
Please remember that these graphs are for Univariate distributions.
#Empirical Cumulative Distribution Function"
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 2.5, kind = 'kde', rug = False)
<seaborn.axisgrid.FacetGrid at 0x1cfd5678a40>
sns.displot(data = df_HR, x = 'MonthlyIncome', kind = 'hist', kde=True)
<seaborn.axisgrid.FacetGrid at 0x1cfca75a750>
sns.displot(data = df_HR, x = 'MonthlyIncome', kind = 'kde', rug = True,col='Gender')
<seaborn.axisgrid.FacetGrid at 0x1cfd7c64200>
plt.figure(figsize = (12,8))
sns.rugplot(data = df_HR, x = 'MonthlyIncome', height = 1)
plt.show()
sns.rugplot(data = df_HR, x = 'MonthlyIncome', height = 1) # rug ===> density
<Axes: xlabel='MonthlyIncome'>
plt.figure(figsize = (12,8))
sns.displot(data = df_HR, x = 'MonthlyIncome',kind = 'ecdf')
sns.displot(data = df_HR, x = 'MonthlyIncome', kind = 'kde')
plt.show()
<Figure size 1200x800 with 0 Axes>
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 1.5, kind = 'ecdf', col = 'Attrition')
<seaborn.axisgrid.FacetGrid at 0x1cfdd6b6a80>
ECDF stands for Empirical Cumulative Distribution. The ECDF plot visualizes each and every data point of the dataset directly in a cumulative manner.
This plot contains more information because it has no bin size setting, which means it doesn’t have any smoothing parameters.
Since its curves are monotonically increasing, so it is well suited for comparing multiple distributions at the same time. In an ECDF plot, the x-axis corresponds to the range of values for the variable whereas the y-axis corresponds to the proportion of data points that are less than or equal to the corresponding value of the x-axis.
Plotting Bivariate Distributions¶
Apart from visualizing the distribution of a single variable, we can see how two independent variables are distributed with respect to each other. Bivariate means joint, so to visualize it, we use the jointplot() function of seaborn library. By default, jointplot draws a scatter plot.
df_HR.shape
(2940, 35)
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'YearsSinceLastPromotion', height = 8, ratio = 3)
plt.show()
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'YearsSinceLastPromotion', height = 8, ratio = 3, \
hue = 'JobSatisfaction', palette = 'bright')
plt.show()
Kind parameter takes the following values:
scatter - Default of jointplot
kde - Using a Kernel density estimator as the joint axis(main box in the above plots).
hist - Using Histogram as the joint axis
hex - Hexplot is a bivariate analog of histogram as it shows the number of observations that fall within
hexagonal bins. This is a plot which works with a large dataset very easily
reg - Is used to plot the data along with a linear regression model fit. The line across the graph is the 'line of best fit' which we shall learn further about in the Stats and ML module. Direct func - regplot()
resid - This method is used to plot the residuals of the linear regression model which we shall learn about in Stats and ML. The line of best fit is the dotted line through 0 on the y axis in the graph with residuals on either side. Direct func = residplot()
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'kde')
plt.show()
sns.jointplot(data = df_HR, x = 'Age', y = 'MonthlyIncome',kind='scatter')
plt.show()
sns.jointplot(data = df_HR, x = 'Age', y = 'MonthlyIncome',kind='hex',palette='rocket',color='red')
plt.show()
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'hist')
plt.show()
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'hex')
plt.show()
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'reg')
plt.show()
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'resid')
plt.show()
Other maps in Seaborn¶
df_HR_num.describe()
| Age | DailyRate | DistanceFromHome | Education | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 |
| mean | 36.923810 | 802.485714 | 9.192517 | 2.912925 | 1470.500000 | 2.721769 | 65.891156 | 2.729932 | 2.063946 | 2.728571 | 6502.931293 | 14313.103401 | 2.693197 | 15.209524 | 3.153741 | 2.712245 | 0.793878 | 11.279592 | 2.799320 | 2.761224 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
| std | 9.133819 | 403.440447 | 8.105485 | 1.023991 | 848.849221 | 1.092896 | 20.325969 | 0.711440 | 1.106752 | 1.102658 | 4707.155770 | 7116.575021 | 2.497584 | 3.659315 | 0.360762 | 1.081025 | 0.851932 | 7.779458 | 1.289051 | 0.706356 | 6.125483 | 3.622521 | 3.221882 | 3.567529 |
| min | 18.000000 | 102.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | 1.000000 | 1009.000000 | 2094.000000 | 0.000000 | 11.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 30.000000 | 465.000000 | 2.000000 | 2.000000 | 735.750000 | 2.000000 | 48.000000 | 2.000000 | 1.000000 | 2.000000 | 2911.000000 | 8045.000000 | 1.000000 | 12.000000 | 3.000000 | 2.000000 | 0.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
| 50% | 36.000000 | 802.000000 | 7.000000 | 3.000000 | 1470.500000 | 3.000000 | 66.000000 | 3.000000 | 2.000000 | 3.000000 | 4919.000000 | 14235.500000 | 2.000000 | 14.000000 | 3.000000 | 3.000000 | 1.000000 | 10.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
| 75% | 43.000000 | 1157.000000 | 14.000000 | 4.000000 | 2205.250000 | 4.000000 | 84.000000 | 3.000000 | 3.000000 | 4.000000 | 8380.000000 | 20462.000000 | 4.000000 | 18.000000 | 3.000000 | 4.000000 | 1.000000 | 15.000000 | 3.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
| max | 60.000000 | 1499.000000 | 29.000000 | 5.000000 | 2940.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | 4.000000 | 19999.000000 | 26999.000000 | 9.000000 | 25.000000 | 4.000000 | 4.000000 | 3.000000 | 40.000000 | 6.000000 | 4.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
# df_HR_num.drop('EmployeeNumber', inplace = True, axis = 1)
df_HR_num.describe()
| Age | DailyRate | DistanceFromHome | Education | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 | 2940.000000 |
| mean | 36.923810 | 802.485714 | 9.192517 | 2.912925 | 1470.500000 | 2.721769 | 65.891156 | 2.729932 | 2.063946 | 2.728571 | 6502.931293 | 14313.103401 | 2.693197 | 15.209524 | 3.153741 | 2.712245 | 0.793878 | 11.279592 | 2.799320 | 2.761224 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
| std | 9.133819 | 403.440447 | 8.105485 | 1.023991 | 848.849221 | 1.092896 | 20.325969 | 0.711440 | 1.106752 | 1.102658 | 4707.155770 | 7116.575021 | 2.497584 | 3.659315 | 0.360762 | 1.081025 | 0.851932 | 7.779458 | 1.289051 | 0.706356 | 6.125483 | 3.622521 | 3.221882 | 3.567529 |
| min | 18.000000 | 102.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | 1.000000 | 1009.000000 | 2094.000000 | 0.000000 | 11.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 30.000000 | 465.000000 | 2.000000 | 2.000000 | 735.750000 | 2.000000 | 48.000000 | 2.000000 | 1.000000 | 2.000000 | 2911.000000 | 8045.000000 | 1.000000 | 12.000000 | 3.000000 | 2.000000 | 0.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
| 50% | 36.000000 | 802.000000 | 7.000000 | 3.000000 | 1470.500000 | 3.000000 | 66.000000 | 3.000000 | 2.000000 | 3.000000 | 4919.000000 | 14235.500000 | 2.000000 | 14.000000 | 3.000000 | 3.000000 | 1.000000 | 10.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
| 75% | 43.000000 | 1157.000000 | 14.000000 | 4.000000 | 2205.250000 | 4.000000 | 84.000000 | 3.000000 | 3.000000 | 4.000000 | 8380.000000 | 20462.000000 | 4.000000 | 18.000000 | 3.000000 | 4.000000 | 1.000000 | 15.000000 | 3.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
| max | 60.000000 | 1499.000000 | 29.000000 | 5.000000 | 2940.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | 4.000000 | 19999.000000 | 26999.000000 | 9.000000 | 25.000000 | 4.000000 | 4.000000 | 3.000000 | 40.000000 | 6.000000 | 4.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
Heatmaps¶
Heatmaps are graphical representations of data which use color-coding to show different values. Usually, they are used to show values that are between a certain scale and the change in the hue of a single color makes it easier to identify the higher and lower values.
Of course, we have a choice of using multi-colors in cmap to represent the data but it usually isnt as clear to analyze.
df_HR_num.corr()
| Age | DailyRate | DistanceFromHome | Education | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.010661 | -0.001686 | 0.208034 | -0.005175 | 0.010146 | 0.024287 | 0.029820 | 0.509604 | -0.004892 | 0.497855 | 0.028051 | 0.299635 | 0.003634 | 0.001904 | 0.053535 | 0.037510 | 0.680381 | -0.019621 | -0.021490 | 0.311309 | 0.212901 | 0.216513 | 0.202089 |
| DailyRate | 0.010661 | 1.000000 | -0.004985 | -0.016806 | -0.025742 | 0.018355 | 0.023381 | 0.046135 | 0.002966 | 0.030571 | 0.007707 | -0.032182 | 0.038153 | 0.022704 | 0.000473 | 0.007846 | 0.042143 | 0.014515 | 0.002453 | -0.037848 | -0.034055 | 0.009932 | -0.033229 | -0.026363 |
| DistanceFromHome | -0.001686 | -0.004985 | 1.000000 | 0.021042 | 0.016464 | -0.016075 | 0.031131 | 0.008783 | 0.005303 | -0.003669 | -0.017014 | 0.027473 | -0.029251 | 0.040235 | 0.027110 | 0.006557 | 0.044872 | 0.004628 | -0.036942 | -0.026556 | 0.009508 | 0.018845 | 0.010029 | 0.014406 |
| Education | 0.208034 | -0.016806 | 0.021042 | 1.000000 | 0.020950 | -0.027128 | 0.016775 | 0.042438 | 0.101589 | -0.011296 | 0.094961 | -0.026084 | 0.126317 | -0.011111 | -0.024539 | -0.009118 | 0.018422 | 0.148280 | -0.025100 | 0.009819 | 0.069114 | 0.060236 | 0.054254 | 0.069065 |
| EmployeeNumber | -0.005175 | -0.025742 | 0.016464 | 0.020950 | 1.000000 | 0.008712 | 0.017377 | -0.003552 | -0.009020 | -0.022970 | -0.007188 | 0.006177 | -0.000345 | -0.006685 | -0.010338 | -0.034827 | 0.031226 | -0.007047 | 0.011953 | 0.005370 | -0.005779 | -0.004427 | -0.004575 | -0.004716 |
| EnvironmentSatisfaction | 0.010146 | 0.018355 | -0.016075 | -0.027128 | 0.008712 | 1.000000 | -0.049857 | -0.008278 | 0.001212 | -0.006784 | -0.006259 | 0.037600 | 0.012594 | -0.031701 | -0.029548 | 0.007665 | 0.003432 | -0.002693 | -0.019359 | 0.027627 | 0.001458 | 0.018007 | 0.016194 | -0.004999 |
| HourlyRate | 0.024287 | 0.023381 | 0.031131 | 0.016775 | 0.017377 | -0.049857 | 1.000000 | 0.042861 | -0.027853 | -0.071335 | -0.015794 | -0.015297 | 0.022157 | -0.009062 | -0.002172 | 0.001330 | 0.050263 | -0.002334 | -0.008548 | -0.004607 | -0.019582 | -0.024106 | -0.026716 | -0.020123 |
| JobInvolvement | 0.029820 | 0.046135 | 0.008783 | 0.042438 | -0.003552 | -0.008278 | 0.042861 | 1.000000 | -0.012630 | -0.021476 | -0.015271 | -0.016322 | 0.015012 | -0.017205 | -0.029071 | 0.034297 | 0.021523 | -0.005533 | -0.015338 | -0.014617 | -0.021355 | 0.008717 | -0.024184 | 0.025976 |
| JobLevel | 0.509604 | 0.002966 | 0.005303 | 0.101589 | -0.009020 | 0.001212 | -0.027853 | -0.012630 | 1.000000 | -0.001944 | 0.950300 | 0.039563 | 0.142501 | -0.034730 | -0.021222 | 0.021642 | 0.013984 | 0.782208 | -0.018191 | 0.037818 | 0.534739 | 0.389447 | 0.353885 | 0.375281 |
| JobSatisfaction | -0.004892 | 0.030571 | -0.003669 | -0.011296 | -0.022970 | -0.006784 | -0.071335 | -0.021476 | -0.001944 | 1.000000 | -0.007157 | 0.000644 | -0.055699 | 0.020002 | 0.002297 | -0.012454 | 0.010690 | -0.020185 | -0.005779 | -0.019459 | -0.003803 | -0.002305 | -0.018214 | -0.027656 |
| MonthlyIncome | 0.497855 | 0.007707 | -0.017014 | 0.094961 | -0.007188 | -0.006259 | -0.015794 | -0.015271 | 0.950300 | -0.007157 | 1.000000 | 0.034814 | 0.149515 | -0.027269 | -0.017120 | 0.025873 | 0.005408 | 0.772893 | -0.021736 | 0.030683 | 0.514285 | 0.363818 | 0.344978 | 0.344079 |
| MonthlyRate | 0.028051 | -0.032182 | 0.027473 | -0.026084 | 0.006177 | 0.037600 | -0.015297 | -0.016322 | 0.039563 | 0.000644 | 0.034814 | 1.000000 | 0.017521 | -0.006429 | -0.009811 | -0.004085 | -0.034323 | 0.026442 | 0.001467 | 0.007963 | -0.023655 | -0.012815 | 0.001567 | -0.036746 |
| NumCompaniesWorked | 0.299635 | 0.038153 | -0.029251 | 0.126317 | -0.000345 | 0.012594 | 0.022157 | 0.015012 | 0.142501 | -0.055699 | 0.149515 | 0.017521 | 1.000000 | -0.010238 | -0.014095 | 0.052733 | 0.030075 | 0.237639 | -0.066054 | -0.008366 | -0.118421 | -0.090754 | -0.036814 | -0.110319 |
| PercentSalaryHike | 0.003634 | 0.022704 | 0.040235 | -0.011111 | -0.006685 | -0.031701 | -0.009062 | -0.017205 | -0.034730 | 0.020002 | -0.027269 | -0.006429 | -0.010238 | 1.000000 | 0.773550 | -0.040490 | 0.007528 | -0.020608 | -0.005221 | -0.003280 | -0.035991 | -0.001520 | -0.022154 | -0.011985 |
| PerformanceRating | 0.001904 | 0.000473 | 0.027110 | -0.024539 | -0.010338 | -0.029548 | -0.002172 | -0.029071 | -0.021222 | 0.002297 | -0.017120 | -0.009811 | -0.014095 | 0.773550 | 1.000000 | -0.031351 | 0.003506 | 0.006744 | -0.015579 | 0.002572 | 0.003435 | 0.034986 | 0.017896 | 0.022827 |
| RelationshipSatisfaction | 0.053535 | 0.007846 | 0.006557 | -0.009118 | -0.034827 | 0.007665 | 0.001330 | 0.034297 | 0.021642 | -0.012454 | 0.025873 | -0.004085 | 0.052733 | -0.040490 | -0.031351 | 1.000000 | -0.045952 | 0.024054 | 0.002497 | 0.019604 | 0.019367 | -0.015123 | 0.033493 | -0.000867 |
| StockOptionLevel | 0.037510 | 0.042143 | 0.044872 | 0.018422 | 0.031226 | 0.003432 | 0.050263 | 0.021523 | 0.013984 | 0.010690 | 0.005408 | -0.034323 | 0.030075 | 0.007528 | 0.003506 | -0.045952 | 1.000000 | 0.010136 | 0.011274 | 0.004129 | 0.015058 | 0.050818 | 0.014352 | 0.024698 |
| TotalWorkingYears | 0.680381 | 0.014515 | 0.004628 | 0.148280 | -0.007047 | -0.002693 | -0.002334 | -0.005533 | 0.782208 | -0.020185 | 0.772893 | 0.026442 | 0.237639 | -0.020608 | 0.006744 | 0.024054 | 0.010136 | 1.000000 | -0.035662 | 0.001008 | 0.628133 | 0.460365 | 0.404858 | 0.459188 |
| TrainingTimesLastYear | -0.019621 | 0.002453 | -0.036942 | -0.025100 | 0.011953 | -0.019359 | -0.008548 | -0.015338 | -0.018191 | -0.005779 | -0.021736 | 0.001467 | -0.066054 | -0.005221 | -0.015579 | 0.002497 | 0.011274 | -0.035662 | 1.000000 | 0.028072 | 0.003569 | -0.005738 | -0.002067 | -0.004096 |
| WorkLifeBalance | -0.021490 | -0.037848 | -0.026556 | 0.009819 | 0.005370 | 0.027627 | -0.004607 | -0.014617 | 0.037818 | -0.019459 | 0.030683 | 0.007963 | -0.008366 | -0.003280 | 0.002572 | 0.019604 | 0.004129 | 0.001008 | 0.028072 | 1.000000 | 0.012089 | 0.049856 | 0.008941 | 0.002759 |
| YearsAtCompany | 0.311309 | -0.034055 | 0.009508 | 0.069114 | -0.005779 | 0.001458 | -0.019582 | -0.021355 | 0.534739 | -0.003803 | 0.514285 | -0.023655 | -0.118421 | -0.035991 | 0.003435 | 0.019367 | 0.015058 | 0.628133 | 0.003569 | 0.012089 | 1.000000 | 0.758754 | 0.618409 | 0.769212 |
| YearsInCurrentRole | 0.212901 | 0.009932 | 0.018845 | 0.060236 | -0.004427 | 0.018007 | -0.024106 | 0.008717 | 0.389447 | -0.002305 | 0.363818 | -0.012815 | -0.090754 | -0.001520 | 0.034986 | -0.015123 | 0.050818 | 0.460365 | -0.005738 | 0.049856 | 0.758754 | 1.000000 | 0.548056 | 0.714365 |
| YearsSinceLastPromotion | 0.216513 | -0.033229 | 0.010029 | 0.054254 | -0.004575 | 0.016194 | -0.026716 | -0.024184 | 0.353885 | -0.018214 | 0.344978 | 0.001567 | -0.036814 | -0.022154 | 0.017896 | 0.033493 | 0.014352 | 0.404858 | -0.002067 | 0.008941 | 0.618409 | 0.548056 | 1.000000 | 0.510224 |
| YearsWithCurrManager | 0.202089 | -0.026363 | 0.014406 | 0.069065 | -0.004716 | -0.004999 | -0.020123 | 0.025976 | 0.375281 | -0.027656 | 0.344079 | -0.036746 | -0.110319 | -0.011985 | 0.022827 | -0.000867 | 0.024698 | 0.459188 | -0.004096 | 0.002759 | 0.769212 | 0.714365 | 0.510224 | 1.000000 |
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize = (20, 10))
sns.heatmap(df_HR_num.corr(), cmap = sns.color_palette('rocket', as_cmap=True),annot=True)
plt.show()
# show annot=True
plt.figure(figsize=(20,10))
sns.heatmap(df_HR_num.corr(),annot=True,cmap='rocket')
plt.show()
Above we have taken the correlation of each column with other columns (-1 representing perfect inverse correlation, 0 meaning no correlation, 1 meaning perfect positive correlation). So, the data is between the scale -1 to 1. And is easily identified by the darkening of the Blue color for positive correlation and lightening for negative correlation.
Boxen Plot using Seaborn¶
Another plot that we can use to show the bivariate distribution is the boxen plot. Boxen plots were originally named letter value plots as it shows a large number of values of a variable, also known as quantiles. These quantiles are also defined as letter values. By plotting a large number of quantiles, it provides more insights about the shape of the distribution. These are similar to box plots.
We can draw these plots using catplot() with kind 'boxen' or directly by calling boxenplot()
sns.catplot(data = df_HR, x = 'DailyRate', kind = 'boxen')
<seaborn.axisgrid.FacetGrid at 0x1cfdf5b40b0>
Visualizing Pairwise Relationships in a Dataset¶
We can also plot multiple bivariate distributions in a dataset by using the pairplot() function of the seaborn library. This shows the relationship between each column of the database. It also draws the univariate distribution plot of each variable on the diagonal axis. Let’s see how it looks.
df_HR_num.columns
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeNumber',
'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager'],
dtype='object')
df_HR_num2 = df_HR_num[['Age', 'HourlyRate', 'JobSatisfaction', 'NumCompaniesWorked', 'PerformanceRating', 'StockOptionLevel',\
'TotalWorkingYears', 'TrainingTimesLastYear','WorkLifeBalance', 'YearsWithCurrManager']].copy()
df_HR_num2.head()
| Age | HourlyRate | JobSatisfaction | NumCompaniesWorked | PerformanceRating | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 94 | 4 | 8 | 3 | 0 | 8 | 0 | 1 | 5 |
| 1 | 49 | 61 | 2 | 1 | 4 | 1 | 10 | 3 | 3 | 7 |
| 2 | 37 | 92 | 3 | 6 | 3 | 0 | 7 | 3 | 3 | 0 |
| 3 | 33 | 56 | 3 | 1 | 3 | 0 | 8 | 3 | 3 | 0 |
| 4 | 27 | 40 | 2 | 9 | 3 | 1 | 6 | 3 | 3 | 2 |
sns.pairplot(df_HR_num2, height = 2, aspect = 1, kind = 'reg', diag_kind = 'hist')
plt.savefig('Pairplot.jpg')
plt.show()
sns.pairplot(df_HR_num)
<seaborn.axisgrid.PairGrid at 0x1cfdf1e2180>
kind parameter (bivariate plot) takes the following arguments: 'scatter', 'kde', 'hist', 'reg' Default is 'scatter'
diag_kind parameter displays Univariate distribution of the column and takes the following parameters: 'auto', 'hist', 'kde', None. Default is auto.
df_HR_num2.plot(kind="box", subplots=True, layout=(7,5),figsize=(20,20))
Age Axes(0.125,0.786098;0.133621x0.0939024) HourlyRate Axes(0.285345,0.786098;0.133621x0.0939024) JobSatisfaction Axes(0.44569,0.786098;0.133621x0.0939024) NumCompaniesWorked Axes(0.606034,0.786098;0.133621x0.0939024) PerformanceRating Axes(0.766379,0.786098;0.133621x0.0939024) StockOptionLevel Axes(0.125,0.673415;0.133621x0.0939024) TotalWorkingYears Axes(0.285345,0.673415;0.133621x0.0939024) TrainingTimesLastYear Axes(0.44569,0.673415;0.133621x0.0939024) WorkLifeBalance Axes(0.606034,0.673415;0.133621x0.0939024) YearsWithCurrManager Axes(0.766379,0.673415;0.133621x0.0939024) dtype: object
print(dir(pd))
['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_built_with_meson', '_config', '_is_numpy_dev', '_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing', '_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'io', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'offsets', 'option_context', 'options', 'pandas', 'period_range', 'pivot', 'pivot_table', 'plotting', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 'read_xml', 'reset_option', 'set_eng_float_format', 'set_option', 'show_versions', 'test', 'testing', 'timedelta_range', 'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'tseries', 'unique', 'util', 'value_counts', 'wide_to_long']
Finally, the best plot of them ALL.¶
import seaborn as sns
sns.dogplot()
The creators of seaborn put in an Easter egg — call sns.dogplot() and seaborn will randomly return a high-resolution picture of an adorable dog!
print(dir(sns))
['FacetGrid', 'JointGrid', 'PairGrid', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_base', '_compat', '_core', '_docstrings', '_orig_rc_params', '_statistics', '_stats', 'algorithms', 'axes_style', 'axisgrid', 'barplot', 'blend_palette', 'boxenplot', 'boxplot', 'categorical', 'catplot', 'choose_colorbrewer_palette', 'choose_cubehelix_palette', 'choose_dark_palette', 'choose_diverging_palette', 'choose_light_palette', 'clustermap', 'cm', 'color_palette', 'colors', 'countplot', 'crayon_palette', 'crayons', 'cubehelix_palette', 'dark_palette', 'desaturate', 'despine', 'displot', 'distplot', 'distributions', 'diverging_palette', 'dogplot', 'ecdfplot', 'external', 'get_data_home', 'get_dataset_names', 'heatmap', 'histplot', 'hls_palette', 'husl_palette', 'jointplot', 'kdeplot', 'light_palette', 'lineplot', 'lmplot', 'load_dataset', 'matrix', 'miscplot', 'move_legend', 'mpl', 'mpl_palette', 'pairplot', 'palettes', 'palplot', 'plotting_context', 'pointplot', 'rcmod', 'regplot', 'regression', 'relational', 'relplot', 'reset_defaults', 'reset_orig', 'residplot', 'rugplot', 'saturate', 'scatterplot', 'set', 'set_color_codes', 'set_context', 'set_hls_values', 'set_palette', 'set_style', 'set_theme', 'stripplot', 'swarmplot', 'utils', 'violinplot', 'widgets', 'xkcd_palette', 'xkcd_rgb']
print(dir(df_HR))
['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'T', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__arrow_c_stream__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__dataframe_consortium_standard__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_accum_func', '_agg_examples_doc', '_agg_see_also_doc', '_align_for_op', '_align_frame', '_align_series', '_append', '_arith_method', '_arith_method_with_reindex', '_as_manager', '_attrs', '_box_col_values', '_can_fast_transpose', '_check_inplace_and_allows_duplicate_labels', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_cmp_method', '_combine_frame', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_result', '_constructor', '_constructor_from_mgr', '_constructor_sliced', '_constructor_sliced_from_mgr', '_create_data_for_split_and_tight_to_dict', '_data', '_deprecate_downcast', '_dir_additions', '_dir_deletions', '_dispatch_frame_op', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_flags', '_flex_arith_method', '_flex_cmp_method', '_from_arrays', '_from_mgr', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cleaned_column_resolvers', '_get_column_array', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_value', '_get_values_for_csv', '_getitem_bool_array', '_getitem_multilevel', '_getitem_nocopy', '_getitem_slice', '_gotitem', '_hidden_attrs', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_inplace_method', '_internal_names', '_internal_names_set', '_is_copy', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_view', '_is_view_after_cow_rules', '_iset_item', '_iset_item_mgr', '_iset_not_inplace', '_item_cache', '_iter_column_arrays', '_ixs', '_logical_func', '_logical_method', '_maybe_align_series_as_frame', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_mgr', '_min_count_stat_function', '_needs_reindex_multi', '_pad_or_backfill', '_protect_consolidate', '_reduce', '_reduce_axis1', '_reindex_axes', '_reindex_multi', '_reindex_with_indexers', '_rename', '_replace_columnwise', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_series', '_set_axis', '_set_axis_name', '_set_axis_nocheck', '_set_is_copy', '_set_item', '_set_item_frame_value', '_set_item_mgr', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_shift_with_freq', '_should_reindex_frame_op', '_slice', '_stat_function', '_stat_function_ddof', '_take_with_is_copy', '_to_dict_of_blocks', '_to_latex_via_styler', '_typ', '_update_inplace', '_validate_dtype', '_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'apply', 'applymap', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'attrs', 'axes', 'backfill', 'between_time', 'bfill', 'bool', 'boxplot', 'clip', 'columns', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_dict', 'from_records', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'isetitem', 'isin', 'isna', 'isnull', 'items', 'iterrows', 'itertuples', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'map', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_flags', 'set_index', 'shape', 'shift', 'size', 'skew', 'sort_index', 'sort_values', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_orc', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'to_xml', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'values', 'var', 'where', 'xs']